rna sequence
- Asia > South Korea > Seoul > Seoul (0.05)
- Europe > Austria > Vienna (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling
van Spengler, Max, Moskalev, Artem, Mansi, Tommaso, Prakash, Mangal, Liao, Rui
Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences. Language models have been increasingly applied to biological sequence data, fueled by the growth of large-scale omics datasets (Lin et al., 2023; Celaj et al., 2023; Brixi et al., 2025). The biological sequences, however, are structured differently from natural language, particularly in their hierarchical organization, where nucleotides or amino acids form motifs that can be nested within larger functional groups (Buhr et al., 2016). In this work, we take the rapidly expanding therapeutic domain of RNA, where the codon-amino acid hierarchy plays a key role in determining the biophysical properties of mRNA sequences and their expressed proteins (Clancy & Brown, 2008), and we focus on encoding this hierarchy directly into the representation space of a bio-language model by leveraging hyperbolic geometry.
- Asia > Middle East > Israel (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction
Li, Jie, McCarthy, Andrew, Zhang, Zhizhuo, Young, Stephen
In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model's predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. We developed the OligoICP method, which selects and averages an ensemble of models with the lowest mean IQR for siRNA efficacy prediction, achieving superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > San Mateo County > South San Francisco (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
Curriculum-Augmented GFlowNets For mRNA Sequence Generation
Laajil, Aya, Shtanchaev, Abduragim, Muhammad, Sajan, Moulines, Eric, Lahlou, Salem
Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences. Imagine a molecule that can be designed to instruct human cells to produce a protein of interest. Such is the promise of messenger RNA (mRNA), which has become a cornerstone of modern biotechnology (Pardi et al., 2018; Sahin et al., 2014). Designing de novo mRNA sequences, that encode a target protein and achieve optimality on particular properties of interest (Gustafsson et al., 2004; Kane, 1995; Mauger et al., 2019), is therefore of growing practical importance. This task can be framed as generating long, structured sequences under multiple, often competing objectives, which makes search and optimization challenging (Keeney & Raiffa, 1993; Zhang et al., 2023; Angermueller et al., 2020). Because biological targets are diverse and downstream outcomes are difficult to predict, diversity is a central design criterion (Mullis et al., 2019). This need is amplified by the limited predictive power of inexpensive screening methods, such as in-silico simulations or in vitro assays.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
CodonMoE: DNA Language Models for mRNA Analyses
Du, Shiyi, Liang, Litian, Li, Jiayi, Kingsford, Carl
Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Louisiana (0.04)
Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data
Amirabad, Azim Dehghani, Zhang, Yanfei, Moskalev, Artem, Rajesh, Sowmya, Mansi, Tommaso, Li, Shuwei, Prakash, Mangal, Liao, Rui
Predicting guide RNA (gRNA) activity is critical for effective CRISPR-Cas12 genome editing but remains challenging due to limited data, variation across protospacer adjacent motifs (PAMs-short sequence requirements for Cas binding), and reliance on large-scale training. We investigate whether pre-trained biological foundation model originally trained on transcriptomic data can improve gRNA activity estimation even without domain-specific pre-training. Using embeddings from existing RNA foundation model as input to lightweight regressor, we show substantial gains over traditional baselines. We also integrate chromatin accessibility data to capture regulatory context, improving performance further. Our results highlight the effectiveness of pre-trained foundation models and chromatin accessibility data for gRNA activity prediction.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Lebanon > Keserwan-Jbeil Governorate > Blat (0.04)